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Abstract 

We propose a new primal-dual algorithmic framework for a prototypical con¬ 
strained convex optimization template. The algorithmic instances of our frame¬ 
work are universal since they can automatically adapt to the unknown Holder 
continuity degree and constant within the dual formulation. They are also guaran¬ 
teed to have optimal convergence rates in the objective residual and the feasibility 
gap for each Holder smoothness degree. In contrast to existing primal-dual algo¬ 
rithms, our framework avoids the proximity operator of the objective function. We 
instead leverage computationally cheaper, Fenchel-type operators, which are the 
main workhorses of the generalized conditional gradient (GCG)-type methods. In 
contrast to the GCG-type methods, our framework does not require the objective 
function to be differentiable, and can also process additional general linear inclu¬ 
sion constraints, while guarantees the convergence rate on the primal problem. 


1 Introduction 

This paper constructs an algorithmic framework for the following convex optimization template: 

/* := min{/(x) : Ax- be 1C] , (1) 

where / : R p —» R u {+oo} is a convex function, A e R™ xp , b e R n , and X and /C are nonempty, 
closed and convex sets in BJ' and BJ' respectively. The constrained optimization formulation {T} is 
quite flexible, capturing many important learning problems in a unified fashion, including matrix 
completion, sparse regularization, support vector machines, and submodular optimization 00- 

Processing the inclusion Ax — b e K. in ([T} requires a significant computational effort in the large- 
scale setting |4j. Hence, the majority of the scalable numerical solution methods for (JTJ are of 
the primal-dual-type, including decomposition, augmented Lagrangian, and alternating direction 
methods: cf, @0. The efficiency guarantees of these methods mainly depend on three properties 
of /: Lipschitz gradient, strong convexity, and the tractability of its proximal operator. For instance, 
the proximal operator of /, i.e., prox/x) := argmin z {/(z) + (1/2)||z — x|| 2 }, is key in handling 
non-smooth / while obtaining the convergence rates as if it had Lipschitz gradient. 

When the set Ax— be/C is absent in 0- other methods can be preferable to primal-dual algorithms. 
For instance, if / has Lipschitz gradient, then we can use the accelerated proximal gradient methods 
by applying the proximal operator for the indicator function of the set X HMD- However, as the 
problem dimensions become increasingly larger, the proximal tractability assumption can be restric¬ 
tive. This fact increased the popularity of the generalized conditional gradient (GCG) methods (or 
Frank-Wolfe-type algorithms), which instead leverage the following Fenchel-type oracles JTj |12||13[ 

M-fo : = argmax{<x,s)- ff (s)}, (2) 

SEA 

where g is a convex function. When g = 0, we obtain the so-called linear minimization oracle JT2) . 
When X = R p , then the (sub)gradient of the Fenchel conjugate of g , Vg*, is in the set [x]&. 
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The sharp-operator in Q is often much cheaper to process as compared to the prox operator (T| 
12J. While the GCG-type algorithms require O (1/e)-iterations to guarantee an e -primal objective 
residual/duality gap, they cannot converge when their objective is nonsmooth m- 

To this end, we propose a new primal-dual algorithmic framework that can exploit the sharp-operator 
of / in lieu of its proximal operator. Our aim is to combine the flexibility of proximal primal-dual 
methods in addressing the general template ([TJ while leveraging the computational advantages of 
the GCG-type methods. As a result, we trade off the computational difficulty per iteration with the 
overall rate of convergence. While we obtain optimal rates based on the sharp-operator oracles, 
we note that the rates reduce to O fl/e 2 ) with the sharp operator vs. O (1/e) with the proximal 
operator when / is completely non-smooth ( cf. Definition 1 1.1 [ >. Intriguingly, the convergence rates 
are the same when / is strongly convex. Unlike GCG-type methods, our approach can now handle 
nonsmooth objectives in addition to complex constraint structures as in fl- 

Our primal-dual framework is universal in the sense the convergence of our algorithms can optimally 
adapt to the Holder continuity of the dual objective g (cf., fl in Section[3]i without having to know 
its parameters. By Holder continuity, we mean the (sub)gradient Vg of a convex function g satisfies 
||V<?(A) — V.g(A)|| < M„||A — A||" with parameters M v < oo and v e [0,1] for all A, A e 1™. 
The case v = 0 models the bounded subgradient, whereas v = 1 captures the Lipschitz gradient. 
The Holder continuity has recently resurfaced in unconstrained optimization by p5| with universal 
gradient methods that obtain optimal rates without having to know M v and v. Unfortunately, these 
methods cannot directly handle the general constrained template (JT|). After our initial draft appeared, 
fl4| presented new GCG-type methods for composite minimization, i.e., min xeHP /(x) + ^>(x), 
relying on Holder smoothness of / (i.e., v e (0,1]) and the sharp-operator of ip. The methods 
in 1141 do not apply when / is non-smooth. In addition, they cannot process the additional inclusion 
Ax — b e K, in Q. which is a major drawback for machine learning applications. 

Our algorithmic framework features a gradient method and its accelerated variant that operates on 
the dual formulation of fl. For the accelerated variant, we study an alternative to the universal 
accelerated method of fly based on FISTA | j 1 0| since it requires less proximal operators in the 
dual. While the FISTA scheme is classical, our analysis of it with the Holder continuous assumption 
is new. Given the dual iterates, we then use a new averaging scheme to construct the primal-iterates 
for the constrained template Q. In contrast to the non-adaptive weighting schemes of GCG-type 
algorithms, our weights explicitly depend on the local estimates of the Holder constants M v at each 
iteration. Finally, we derive the worst-case complexity results. Our results are optimal since they 
match the computational lowerbounds in the sense of first-order black-box methods 09- 

Paper organization: Section|2 briefly recalls primal-dual formulation of problem Q with some 
standard assumptions. Section[3]defines the universal gradient mapping and its properties. Section[4] 
presents the primal-dual universal gradient methods (both the standard and accelerated variants), and 
analyzes their convergence. Section[5]provides numerical illustrations, followed by our conclusions. 
The supplementary material includes the technical proofs and additional implementation details. 

Notation and terminology: For notational simplicity, we work on the W/W"' spaces with the 
Euclidean norms. We denote the Euclidean distance of the vector u to a closed convex set X by 
dist (u, X). Throughout the paper, || • || represents the Euclidean norm for vectors and the spectral 
norm for the matrices. For a convex function /, we use V/ both for its subgradient and gradient, and 
/* for its Fenchel’s conjugate. Our goal is to approximately solve (|T]> to obtain x, in the following 
sense: 

Definition 1.1. Given an accuracy level e > 0, a point x £ e X is said to be an e-solution of fl if 
|/(x e ) — /* | ^ e, and dist (Ax £ — b, K.) < e. 

Here, we call |/(x e ) — /* | the primal objective residual and dist (Ax e — b, K.) the feasibility gap. 

2 Primal-dual preliminaries 

In this section, we briefly summarise the primal-dual formulation with some standard assumptions. 
For the ease of presentation, we reformulate Q by introducing a slack variable r as follows: 

/* = /(FVi/W : Ax-r = b}, ( x * :/( x *) =/*)• (3) 

xeA\re/C 

Let z := [x, r] and Z := X x JC. Then, we have V:= {z e Z : Ax—r = b} as the feasible set of ([3]). 
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The dual problem: The Lagrange function associated with the linear constraint Ax r = b is 
defined as £(x, r, A) := /(x) + (A, Ax — r — b), and the dual function d of ([3]» can be defined and 
decomposed as follows: 

d{ A) := min {/(x) + (A, Ax r — b)} = min {/(x) + (A, Ax — b)} + min (A, —r), 

xeX xeX re/C 

re/C '-v-' '-v-' 

d*(A) d r (X) 

where A e R” is the dual variable. Then, we define the dual problem of 0 as follows: 

d* := max d( A) = max I da;(A) + d r ( A) 1 . (4) 

AeR™ AeR™ 1 J 

Fundamental assumptions: To characterize the primal-dual relation between 0 and 0, we re¬ 
quire the following assumptions GZ) : 

Assumption A. 1. The function / is proper, closed, and convex, but not necessarily smooth. The 
constraint sets A and 1C are nonempty, closed, and convex. The solution set X* of 0 is nonempty. 
Either Z is polyhedral or the Slater’s condition holds. By the Slater’s condition, we mean ri (Z) n 
{(x, r) : Ax — r = b} A 0, where ri (Z) stands for the relative interior of Z. 

Strong duality: Under Assumption A jT] the solution set A* of the dual problem 0 is also 
nonempty and bounded. Moreover, the strong duality holds, i.e., /* = d*. 


3 Universal gradient mappings 

This section defines the universal gradient mapping and its properties. 

3.1 Dual reformulation 

We first adopt the composite convex minimization formulation JTT) of 0 in convex optimization 
for better interpretability as 


G* := min {G(A) := g( A) + h{ A)} , 


XeR' 


(5) 


( 6 ) 


where G* = —d*, and the correspondence between (g, h) and (d x , d r ) is as follows: 

{ gW := max{<A, b — Ax> — /(x)} = —d x (\), 

J xeX 

) h{ A) := max (A, r) = —d r ( A). 

I re/C 

Since g and h are generally non-smooth, FISTA and its proximal-based analysis m are not directly 
applicable. Recall the sharp operator defined in ([2]), then g can be expressed as 

g (A) = max{<-A T A,x>- /(x)} + <A,b>, 

and we define the optimal solution to the g subproblem above as follows: 

x*(A) e argmax{<-A J A,x> - /(x)} = [-A T \f x f . (7) 

The second term, h , depends on the structure of 1C. We consider three special cases: 

(a) Sparsity/low-rankness: If K, := { re R n : ||r|| «c for a given k R 0 and a given norm || • ||, 
then h{ A) = k||A||*, the scaled dual norm of || • ||. For instance, if fC := {re M n : ||r|ji < k}, 
then h( A) = k||A||oo. While the fi-norm induces the sparsity of x, computing h requires the max 


absolute elements of A. If 1C := {re! 


?1 x Q 2 


^ n} (the nuclear norm), then h{ A) = k||A||, 


the spectral norm. The nuclear norm induces the low-rankness of x. Computing h in this case leads 
to finding the top-eigenvalue of A, which is efficient. 

(b) Cone constraints: If /C is a cone, then h becomes the indicator function S K * of its dual cone 

1C*. Hence, we can handle the inequality constraints and positive semidefinite constraints in Q. For 
instance, if 1C = R”, then //(A) = (A), the indicator function of R" := {A 6 R™ : A ^ 0}. If 

K, = SI, then h(A) := Sgp (A), the indicator function of the negative semidefinite matrix cone. 

(c) Separable structures: If A and / are separable, i.e., A := ] Jj' = l A \ and /(x) := Yh=\ /i( x i)> 
then the evaluation of g and its derivatives can be decomposed into p subproblems. 
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3.2 Holder continuity of the dual universal gradient 

Let Vp(-) be a sub gradient of g, which can be computed as Vg( A) = b — Ax*(A). Next, we define 




sup ( nwo-y»wi | 

A.AeR^.A/A^ ||A — Al" j 


( 8 ) 


where v ^ 0 is the Holder smoothness order. Note that the parameter M v explicitly depends on 
v [ 151. We are interested in the case v e [0,1], and especially the two extremal cases, where 
we either have the Lipschitz gradient that corresponds to v = 1, or the bounded subgradient that 
corresponds to v = 0. 


We require the following condition in the sequel: 
Assumption A. 2. M(g) '■= ^ inf ^ M v (g) < +oo. 


Assumption A|2]is reasonable. We explain this claim with the following two examples. First, if g is 
subdifferentiable and X is bounded, then Vg(-) is also bounded. Indeed, we have 

||V.g(A)|| = ||b — Ax*(A)|| < := sup{||b — Ax|| : x e X }. 


Hence, we can choose v = 0 and M v (g) = 2 < oo. 

Second, if / is uniformly convex with the convexity parameter Pf > 0 and the degree q + 2, i.e., 
(V/(x) — V/(x), x — x) ft/||x — x|| 9 for all x, x e M p , then g defined by <[6]> satisfies <[8]) with 


v = anc | M„(g) = (p^. 1 ||A|| 2 ) 9-1 < +oo, as shown in 115 . In particular, if q = 2, i.e., / 
is /xy-strongly convex, then v = 1 and M v (g) = /Tj||A|| 2 , which is the Lipschitz constant of the 
gradient Vg. 


3.3 The proximal-gradient step for the dual problem 

Given A^ e R" and Mk > 0, we define 

Qu k ( A; A k ) := g( A,.) + <V 5 ( A fc ), A - A fc > + ^||A - A fc |[ 2 
as an approximate quadratic surrogate of g. Then, we consider the following update rule: 

A fc+1 := arg min { Q Mk (\ ; X k ) + h{ A)} = prox M -i fe (A fc - M“ 1 Vp(A fc )) . (9) 

For a given accuracy e > 0, we define 

1-v 

1+V Mj^. ( 10 ) 


M f : = 


1-ul 
1 + V e 


We need to choose the parameter M k > 0 such that Q M k is an approximate upper surrogate of g, 
i.e., g{ A) XQM k (X: X k ) + 8k for some A e K n and 8 k + 0. If u and M v are known, then we can 
set Mk = M e defined by (jT0[). In this case, is an upper surrogate of g. In general, we do not 
know v and M v . Hence, M k can be determined via a backtracking line-search procedure. 


4 Universal primal-dual gradient methods 


We apply the universal gradient mappings to the dual problem ([5]), and propose an averaging scheme 
to construct {x/ ;: } for approximating x*. Then, we develop an accelerated variant based on the FISTA 
scheme 101, and construct another primal sequence {x/, : } for approximating x*. 


4.1 Universal primal-dual gradient algorithm 


Our algorithm is shown in Algorithm 1. The dual steps are simply the universal gradient method 
in (HJ, while the new primal step allows to approximate the solution of Q- 

Complexity-per-iteration: First, computing x*(A/.) at Step 1 requires the solution x*(A k ) e 
[—A 7 A k]^x /■ F° r many X and /, we can compute x*(A k ) efficiently and often in a closed form. 
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Algorithm 1 (Universal Primal-Dual Gradient Method (UniPDGrad)) 


Initialization: Choose an initial point Ao e R" and a desired accuracy level e > 0. 
Estimate a value M_x such that 0 < M_i M t . Set SR 1 = 0 and x_! = OR 
for k = 0 to k max 


1. Compute a primal solution x*(X k ) 6 [—A 7 A/. \ ' v j. 

2. Form V g(X k ) = b Ax*(Afc). 

3. Line-search: Set M fc , 0 = 0.5M fc _i. For i = 0 to i m£ 


3.a. Compute the trial point A k,i = pr ox M -i h (-V — 7 V<?(Afc)J. 

3.b. If the following line-search condition holds: 


ic, perform the following steps: 


g(,X k ,i) R Afc) + r/2, 


then set i k = i and terminate the line-search loop. Otherwise, set M k ^ + \ = 2M k ^. 

End of line-search 

4. Set X k+ i = Aand M fc = M k>ik . Compute = jL, S k = S k ^+w k , and 7fc = 

5. Compute x fc = (1 - 7 fc )x fe _ 1 + 7 fc x*(A fc ). 

end for 

Output: Return the primal approximation x/. for x*. 


Second, in the line-search procedure, we require the solution X k ^ at Step 3.a, and the evaluation of 
g(X k i). The total computational cost depends on the proximal operator of h and the evaluations of 
g. We prove below that our algorithm requires two oracle queries of g on average. 

Theorem 4.1. The primal sequence {x/, : } generated by the A Igorithm [7] satisfies 

— IIA*||dist (Axt - b,/C) < /(i fc ) - /* < +L (ID 

k + 1 2 

dist (Ax fc - b, /C) < 1A 0 - A* || + (12) 


where M e is defined by m A * e A * is an arbitrary dual solution, and e is the desired accuracy. 

The worst-case analytical complexity: We establish the total number of iterations fc max to achieve 
an e-solution ii k of Q. The supplementary material proves that 


^max 


4x/2||A* 


— 1 + A/l + 


ll A *l[l] 


1 2 


inf 

( M v \ 


\ e ) 


(13) 


where II A* 


[i] 


= max {||A*||, 1}. This complexity is optimal for v = 0, but not for v > 0 161. 


At each iteration k, the linesearch procedure at Step 3 requires the evaluations of g. The supple¬ 
mentary material bounds the total number N\{k) of oracle queries, including the function G and its 
gradient evaluations, up to the fcth iteration as follows: 


N x{k) < 2 (k + 1) + 1 - log 2 (M_ 1 ) + o mf i |^ log 2 log 2 M v J . (14) 

Hence, we have N\ ( k ) = 2(fc+1), i.e., we require approximately two oracle queries at each iteration 
on the average. 


4.2 Accelerated universal primal-dual gradient method 

We now develop an accelerated scheme for solving 0. Our scheme is different from 03 in two 
key aspects. First, we adopt the FISTA |T0) scheme to obtain the dual sequence since it requires 
less prox operators compared to the fast scheme in 03 - Second, we perform the line-search after 
computing Vg(A*,), which can reduce the number of the sharp-operator computations of / and X. 
Note that the application of FISTA to the dual function is not novel per se. However, we claim that 
our theoretical characterization of this classical scheme based on the Holder continuity assumption 
in the composite minimization setting is new. 
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Algorithm 2 ( Accelerated Universal Primal-Dual Gradient Method (AccUniPDGrad)) 

Initialization: Choose an initial point Ao = Ao e W' and an accuracy level e > 0. 

Estimate a value M_i such that 0 < M_i ^ M e . Set S-i = 0, to = 1 and x_i = 0 P . 
for k = 0 to k max 

1. Compute a primal solution x*(X k ) e [—A 7 A|' t , j. 

2. Form Vg(X k ) = b — Ax*(Afc). 

3. Line-search: Set M k g = M k -\ • For i = 0 to * max , perform the following steps: 

3.a. Compute the trial point A k ,i = P mx My 1 h( Xk ~ g(\k))■ 

3.b. If the following line-search condition holds: 

9{^k,i) ^ Q ^k) T C/(2ffc) j 

then i k = i, and terminate the line-search loop. Otherwise, set M k y +1 = 2 M kj i. 

End of line-search 

4. Set A fe+ i = A k ,i k and M k = M k _ ik . Compute w k = S k = S k -i + w k , and y k = w k /S k . 

5. Compute t k +i = 0.5[l + y/l + 4 t 2 k \ and update A fc+ i = A fc+ i + (A fc+ i - A fe ). 

6. Compute x fc = (1 - y k ) x k-t + y k x.*{X k ). 

end for 

Output: Return the primal approximation 5c k for x*. 


Complexity per-iteration: The per-iteration complexity of Algorithm[2]remains essentially the same 
as that of Algorithm [I] 

Theorem 4.2. The primal sequence {x/..} generated by the Algorithm^satisfies 

— IIA* ||dist (Ax*,—b,/C)^/(x fc ) —/* < J + AM ^t: (15) 

2 (k + 2)t+^ 

dist (Ax* —b,/C) < 16M ^. \\\ 0 -\*\\+ 8Me ^ , (16) 

(fc+2) !+•' V (* + 2) 1 + 1 ' 


where M e is defined by m x * e A * is an arbitrary dual solution, and e is the desired accuracy. 

The worst-case analytical complexity: The supplementary material proves the following worst-case 
complexity of Algorithm[2]to achieve an e-solution 5c k : 


k 


max 


8-\/21 A* | 


— 1 + a/1 + 


V 1 


II [1] 


2 + 2v 
1 + 3 ./ 


inf 

Osgi/=gl 



(17) 


This worst-case complexity is optimal in the sense of first-order black box models G 3 - 

The line-search procedure at Step 3 of Algorithm[2]also terminates after a finite number of iterations. 
Similar to Algorithm |T] Algorithm [2] requires 1 gradient query and i k function evaluations of g at 
each iteration. The supplementary material proves that the number of oracle queries in Algorithm[2] 
is upperbounded as follows: 


N 2 (k) 2 (k + 1) + 1 + (J—— [log 2 (fc + 1) - log 2 (e)] + —— \og 2 (M v ) - log 2 (M_i). (18) 

1 + v 1 + v 

Roughly speaking. Algorithm [2] requires approximately two oracle query per iteration on average. 


5 Numerical experiments 

This section illustrates the scalability and the flexibility of our primal-dual framework using some 
applications in the quantum tomography (QT) and the matrix completion (MC). 
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5.1 Quantum tomography with Pauli operators 

We consider the QT problem which aims to extract information from a physical quantum system. A 
('/-qubit quantum system is mathematically characterized by its density matrix, which is a complex 
pxp positive semidefinite Hermitian matrix X" e S'l , where p = 2 q . Surprisingly, we can provably 
deduce the state from performing compressive linear measurements b = -4(X) e C n based on Pauli 
operators A COD- While the size of the density matrix grows exponentially in q , a significantly fewer 
compressive measurements (i.e., n = 0(p\ogp)) suffices to recover a pure state (/-qubit density 
matrix as a result of the following convex optimization problem: 

^ = mmL(X):=i||^(X)-b||^:tr(X) = lJ, (X* : y>(X*) = <p*), (19) 

where the constraint ensures that X* is a density matrix. The recovery is also robust to noise GD- 

Since the objective function has Lipschitz gradient and the constraint (i.e., the Spectrahedron) is 
tuning-free, the QT problem provides an ideal scalability test for both our framework and GCG-type 
algorithms. To verify the performance of the algorithms with respect to the optimal solution in large- 
scale, we remain within the noiseless setting. However, the timing and the convergence behavior of 
the algorithms remain qualitatively the same under polarization and additive Gaussian noise. 




Figure 1: The convergence behavior of algorithms for the q = 14 qubits QT problem. The solid lines 
correspond to the theoretical weighting scheme, and the dashed lines correspond to the line-search 
(in the weighting step) variants. 

To this end, we generate a random pure quantum state (e.g., rank-1 X :3 ), and we take n = 2p log p 
random Pauli measurements. For q = 14 qubits system, this corresponds to a 268'435'456 dimen¬ 
sional problem with n = 138'099 measurements. We recast © into (JTJ) by introducing the slack 
variable r = -4(X) — b. 

We compare our algorithms vs. the Frank-Wolfe method, which has optimal convergence rate guar¬ 
antees for this problem, and its line-search variant. Computing the sharp-operator [x| : requires a 
top-eigenvector ex of A* (X), while evaluating g corresponds to just computing the top-eigenvalue 
cri of A* (A) via a power method. All methods use the same power method subroutine, which is 
implemented in MATLAB’s eigs function. We set e = 2 x 10~ 4 for our methods and have a 
wall-time 2 x 10 4 s in order to stop the algorithms. However, our algorithms seems insensitive to the 
choice of e for the QT problem. 

Figure [T] illustrates the iteration and the timing complexities of the algorithms. UniPDGrad al¬ 
gorithm, with an average of 1.978 line-search steps per iteration, has similar iteration and timing 
performance as compared to the standard Frank-Wolfe scheme with step-size jk = 2/(fc + 2). The 
line-search variant of Frank-Wolfe improves over the standard one; however, our accelerated variant, 
with an average of 1.057 line-search steps, is the clear winner in terms of both iterations and time. 
We can empirically improve the performance of our algorithms even further by adapting a similar 
line-search strategy in the weighting step as Frank-Wolfe, i.e., by choosing the weights Wk in a 
greedy fashion to minimize the objective function. The practical improvements due to line-search 
appear quite significant. 

5.2 Matrix completion with MovieLens dataset 

To demonstrate the flexibility of our framework, we consider the popular matrix completion (MC) 
application. In MC, we seek to estimate a low-rank matrix X e W‘ x 1 from its subsampled entries 
b e M", where _4(-) is the sampling operator, i.e., -4(X) = b. 
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Figure 2: The performance of the algorithms for the MC problems. The dashed lines correspond to 
the line-search (in the weighting step) variants, and the empty and the filled markers correspond to 
the formulation ([20]) and (|2T|, respectively. 


Convex formulations involving the nuclear norm have been shown to be quite effective in estimating 
low-rank matrices from limited number of measurements mi- For instance, we can solve 

min L(X)=-||^(X)-b|| 2 : ||X||* < «} , (20) 

XeRp x 1 (_ n ) 

with Frank-Wolfe-type methods, where k is a tuning parameter, which may not be available a priori. 
We can also solve the following parameter-free version 

min jt/)(X) = -||X|| 2 : A(X.) = bl . (21) 

XeR px! ( n ) 

While the nonsmooth objective of ( |2T) prevents the tuning parameter, it clearly burdens the compu¬ 
tational efficiency of the convex optimization algorithms. 

We apply our algorithms to ( |20] and © using the MovieLens 100K dataset. Frank-Wolfe algo¬ 
rithms cannot handle ( |2T) and only solve ( |2()[ >. For this experiment, we did not pre-process the data 
and took the default ub test and training data partition. We start out algorithms form Ao = 0", we 
set the target accuracy e = 10 -3 , and we choose the tuning parameter k = 9975/2 as in J20|. We 
use lansvd function (MATLAB version) from PROPACK pT) to compute the top singular vectors, 
and a simple implementation of the power method to find the top singular value in the line-search, 
both with 10 5 relative error tolerance. 

The first two plots in Figure [2] show the performance of the algorithms for ( |20] >. Our metrics are 
the normalized objective residual and the root mean squared error (RMSE) calculated for the test 
data. Since we do not have access to the optimal solutions, we approximated the optimal values, 
c p * and RMSE*, by 5000 iterations of AccUniPDGrad. Other two plots in Figure [2]compare the 
performance of the formulations ([20] and ( [2l~] which are represented by the empty and the filled 
markers, respectively. Note that, the dashed line for AccUniPDGrad corresponds to the line-search 
variant, where the weights Wk are chosen to minimize the feasibility gap. Additional details about 
the numerical experiments can be found in the supplementary material. 

6 Conclusions 

This paper proposes a new primal-dual algorithmic framework that combines the flexibility of prox¬ 
imal primal-dual methods in addressing the general template 0 while leveraging the computational 
advantages of the GCG-type methods. The algorithmic instances of our framework are universal 
since they can automatically adapt to the unknown Holder continuity properties implied by the tem¬ 
plate. Our analysis technique unifies Nesterov’s universal gradient methods and GCG-type methods 
to address the more broadly applicable primal-dual setting. The hallmarks of our approach includes 
the optimal worst-case complexity and its flexibility to handle nonsmooth objectives and complex 
constraints, compared to existing primal-dual algorithm as well as GCG-type algorithms, while es¬ 
sentially preserving their low cost iteration complexity. 
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Supplementary document: 

A Universal Primal-Dual Convex Optimization Framework 


In this supplementary document, we provide the technical proofs and additional implementation 
details, and it is organized as follows: Section |A] defines the key estimates, that forms the basis of 


respectively. Finally, Section |D| provides the implementation details of the quantum tomography 
and the matrix completion problems considered in Section [5] 

A The key estimate of the proximal-gradient step 

Lemma 2 in [T|, which we present below as Lemma |A~T| provides key properties for constructing 
universal gradient algorithms. We refer to |1] for the proof of this lemma. 

Lemma A.l. Let function g satisfy the Assumption A{2] Then for any 6 > 0 and 


the universal gradient algorithms. Sections |B| and |C] present the proofs of Theorems 4.1 and 4.2 


M ^ 


1-vl 
1 + v 6 


1-u 
1 + u 


MJ 


the following statement holds for any A, A e : 


3 (A) < 3 (A) + <V ff (A), A - A) + y ||A - A|| 2 +|. 

Qm( A,A) 


This lemma provides an approximate quadratic upper bound for 3 . However, it depends on the 
choice of the inexactness parameter 6 and the smoothness parameter v. If v = 1, then M can be set 
to the Lipschitz constant Mi, and it becomes independent of S. 

The algorithms that we develop in this paper are based on the proximal-gradient step ^ on the dual 
objective function G. This update rule guarantees the following estimate: 

Lemma A.2. Let Qm be the quadratic model of 3 . If Xk+i, which is defined by satisfies 

<?(Afc+i) < QM k (Xk+i', Afc) + — (22) 

for some 5k e R, then the following inequality holds for any A e M n : 

G(Afc + i) ^ g(Xk) + (S7g(\k), A — A&) + h( A) + — 4 —— |^||A — Afc|| 2 — ||A — Afc + i|| 2 j . 

Proof of Lemma \Af2\ We note that the optimality condition of (|9| is 

0 e ^ff(Afc) + Affc(Afc + i — Afc) 4- dh(\k+i), 

which can be written as A& — Xk+i e M A r 1 (Vgfc(Afc) 4- dh(Xk+i)). Let Vft(Afe+i) e dh(Xk+i) be 
a subgradient of h at A/, :+ i. Then, we have 


Afe+i — 


M k 


^Vg(Afc) 4- V/i(Afc+i)j . 


(23) 


1 










Now, using ( |23) l, we can derive 

Ar fe (A) = i||A-A fc+1 |[ 2 -^| 


A-AJI 2 


— (A/; — Afc+i, A — Afc +1 ) - 

(23l 1 


'k+ ill 


M fc 

1 

~Mk 


<Vfl l (Afc) + V/i(Afc + i), A — Afc + i) ■ 


IIA fc -A. 


fe+i| 


(Vg(Afc), Afc + i — Afc) H—— ||Afc+i 


+ -^-<(V/i(Afc + i), A — Afc+i) + -^-(V</(A fc ), A — A fc ) 


P 1 


ff(Afc) — p(Afc+i) + — 


+ -^-{V/i(Afc + i), A — Afc+i) + -^-<Vg( Afc), A — Afc) 


< 


1 

A4 


ff(Afc) — ff(Afc + i) + 


+ ^ [ft(A) “ /l(Afc+l)] + i^< V5(Afc) ’ A - Afc > 


1 

A4 


fl(Afc) + <V 5 (A fc ), A - Afc) + h( A) + |- 


“ W k G{Xk+l) 


where the last inequality directly follows the convexity of h. 


□ 


Clearly, (|22|) holds if M k ^ M e , which is defined by (flQ|), due to Lemma A.l whenever 6 k = e > 0. 


If v and M„ are known, we can set M k = M e , then the condition ([22]) is automatically satisfied. 
However, we do not know v and M v a priori in general. In this case, M k can be determined via a 
line-search procedure on the condition i 


The following lemma guarantees that the line-search procedure in Algorithms [T] and [2] terminates 
after a finite number of line-search iterations. 

Lemma A.3. The line-search procedure in Algorithm^terminates after at most 

ik = [log 2 (M e /M_i)J + 1 

number of iterations. 

Similarly, the line-search procedure in A lgorithmj2jterminates after at most 




log 2 


k + 1 
e 


+ log 2 



+1 


number of iterations. 


Algorithm |lj the upper bound M e = | ( . | 1 + )) 


Proof. Under Assumption A|2] M v defined in Lemma A.l is finite. When 6 k = e > 0 is fixed as in 

2 

A/J + " defined by ([TO]) is also finite. Moreover, 

the condition ( [22] ) holds whenever M ki ^ M e . Since M k l = 2M ki _x = 2 l M k0 ^ 2*M_ 1 , the 
linesearch procedure is terminated after at most i k = [log 2 (M € /M_ 1 )J + 1 iterations. 

Now, we show that the line-search procedure in Algorithm [2] is also finite. By the updating rule of 
ffc, we have ffc + i := 0.5(1 + yj 1 + 4 t k ) < 0.5(1 + (1 + 2ffc)) = t k + 1. By induction and t q = 1, 
we have ffc < k + 1. Using the definition ( |TT)| of M$ k with 6 k = and t k ^ k + 1, we can show 
that 


M s k = 


1 - V 1 

1 + vS k 


Mi 




Mi 


< 


k + 1 


Mi 


(24) 


2 



























Next, we note that the condition ( |22| holds whenever ^ M$ k . However, since Mk.i = 

2*M fcj0 ^ 2 l M_i, by using m- it is sufficient to show that the following condition holds for a 
finite i: 


2*M_i ^ 


k + 1 


MJ 


e 


This condition leads to f ^ log 2 ( [^ 7 ^] 1+ " M 


require at most ik = 


log 2 (^) + log 2 


M„ J 


M_ 


— log 2 (M_i). Hence, at the fcth iteration, we 
+ 1 line-search iterations, which is finite. □ 


B Convergence analysis of the universal primal-dual gradient algorithm 

In this section, we analyze the convergence of the Algorithm[l](UniPDGrad). We first provide the 
convergence guarantee of the dual function in Theorem |B.1| Then, we prove the convergence rate 
and the worst-case complexity given in Theorem|4.1| 


B.l Convergence rate of the dual objective function 


Theorem B.l. Let { A/ i; } be the sequence generated by UniPDGrad. Then, 

G( A fc ) - G(A) ^ G k - G(A) ^ t^-||A 0 - A|| 2 + (25) 

k + 1 2 

for any A e K n , where M e is defined by © and the two averaging sequences {A^} and {Gk} are 
defined as follows: 


A k 



and 


G k := 


•J- Y tt g ( a *+i)> where s k : = Y 


1 


Proof For M c defined by ( [T()| , since the line-search is successful as shown in Lemma |A.l| the 
conditi on (|22| ) is satisfied at iteration i with M r < 2 M e . The following inequality directly follows 
Lemma A.2 considering the convexity of < 7 : 


G(Aj+i) < G(A) + - + [||A-A ,|| 2 - 

Taking the weighted sum of this inequality over i, we get 


HA—A. 


■*Hl 


2 ], VAe 


Gk < G(A) + - + 2^ [||A—A 0 || 2 - ll A ~ A fc+i|| 2 ] 


(26) 


for any A e K", and G(\ k ) < Gk since G is a convex function. Finally, since Mi ^ 2 M e , we have 


S k > 


(fc+i) 

2 M. 


. Substituting this estimate into (|26|), we obtain (|25j). 


□ 


B.2 The proof of Theorem |4.1[ Convergence rate of the primal sequence 

Proof We use the following three expressions to relate the convergence in the dual sequence to the 
convergence in the primal sequence: 

g{\i) = —/(x*(Aj)) + <A l; b Ax* (A*)), 

Vff(Ai) = b — Ax*(Aj), (27) 

G(A i+1 ) 5= G* = —d* = —/*. 

Substituting these expressions into Lemma |A.2| we get the following key estimate in the primal 
space that holds for any A e K n : 

/(X*(AO) - /* < <b - Ax*(Aj), A) + MA) + f + ^ [||A - A,|| 2 - ||A - A l+1 || 2 ] . 
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Taking the weighted sum of this inequality over i and considering the convexity of /, we get 


1 


/(x fc ) - /* < <b - Ax,, A) + h( A) + - + [IIA—A, 


IA—A, 


■fc+H 


’]• 


( 28 ) 


Setting A = 0 ra , we get the bound on the right hand side of (|T5|, 


/(s t ) - r «j + 2Si 


Anil 2 . e AfJAol 


^ 2 + 


k + 1 


The inequality on the left hand side of (IT) follows the following saddle point formulation: 

f* < >C(x,r,A*) = /(x) + <A*,Ax-b-r> /(x) + ||A*||||Ax — b — r||, (29) 

Vr e K, and Vx e A, where the last inequality holds due to Cauchy-Schwarz inequality. The proof 
of the convergence rate in the objective residual (JTTJ follows by setting x = x in ( |29| . 

Next, we prove the convergence rate of the feasibility gap ©■ We start from the following saddle 
point formulation: 

/* ^ £(x, r, A*) = /(x) + (A*, Ax — b — r), Vr e /C, Vx e X. 


Substituting this estimate with x = x, into fl28j ), we get the following inequality: 

<Ax, — b — r*(A), A — A*> — [||A A 0 1| 2 — ||A — A, +1 || 2 ] ^ | 

=> min |<Ax, - b - r, A - A*> - ^-||A-A 0 || 2 | < | 

— S S {< ASt - b - r, A - A*> - A|A-A„|| 2 j S t 

— 5S }< ASt - >> - r, A - A*> - 2 ^I|A-A 0 || 2 j •: | 

min |(Ax, - b - r, A 0 - A*> + ^H|Ax, b r|| 2 } ^ ^ 

re/C Z J Z 

for any A e R n , where r*(A) := arg max rf =/c \r. A), and the third implication holds due to the 
Sion’s minimax theorem. Hence, there exists a vector r e A, that satisfies the following inequality: 

(Ax, - b - r, A 0 - A*> + ^||Ax, - b r|| 2 ^ 

Using Cauchy-Schwarz inequality, this implies 


HI Ax, - b r || || A 0 - A* | + y ||Ax, — b — r|| 

Solving this inequality for ||Ax, — b — r||, we get 

dist (Ax, — b, 1C) < || Ax, — b — f || 

1 

Sk 
1 




l|An-Al + 




2||Ao-A*|| + 


Sk L 

We note that 5, ^ 2 M’ anc * ^' s completes the proof. 


e 

< - 
2 


A 0 — A* || 2 + Sfce 

£,e | . 


□ 


B.2.1 The worst-case complexity analysis 

For simplicity, we choose Aq = 0 n without loss of generality. Then, in order to guarantee both 


dist (Axt-b ,/C) ^ e and |/(x,) — /*| ^ e, we require 


Theorem |4.1| where ||A* 


k +1 ^ 


4 M e |i \* 

k +1 H A 


+ 


2 M e e 
k+1 


id 


:= max 


{|| A*||, 1}. This leads to ( fi~3j i 


II [i] 


V e due to 


as 


4\/2||A*|| 

2 

=> fc - 


4V2||A*|| 

2 inf ( M ^ 

|_ 1 + ^1 + 8 1AM[i J 

^ ^max 

e 


l 1+ V 1+8 I a *II[ 1] J 

Osgi/sSl \ £ J 
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Hence, the worst-case complexity to obtain an e-solution of (JTji in the sense of Definition! l.l|is 



which is optimal if v = 0. 

Next, we estimate the total number of oracle quires in UniPDGrad, as in 0 . The total number of 
oracle quires up to the iteration k is given by Ni(k) = Xlj=o(X 7 + !)■ However, since ij — 1 = 
logwe have 

k 

Ni{k) = Xj (*i + !) = 2 ( fc + 1) + log 2 (M fe ) - log 2 (M_i). 
o =o 

It remains to use M k ^ 2 M e to obtain ( fl4| . 

C Convergence analysis of the accelerated universal primal-dual algorithm 

We now analyze the convergence of AccUniPDGrad (Algorithmic in terms of the objective residual 
and the feasibility gap. 

The dual main step of our algorithm is to update A / r+1 and A / =+1 from A/, and Afc as follows: 

A k := (1 — Tk)Xk + TkXk 

< x k +1 := P rox M~ 1 h (** - (30) 

^fc + 1 • ^k k ^/c+ 1 ^ i 

where Ao = Ao, tq = 1 and 

T l = T l- l(l-T-fc)- (31) 

The parameter M k is determined based on the following line-search condition: 

g(Afc+i) ^ fl(Afc) + (Vg(Afc), Afc +1 — A/.) H—— |Afc +1 — Afc||" + (32) 

Z Ztk 

with Mk ^ Mk- 1 - 

Next, we simplify the scheme ( |30] > in the following lemma: 

Lemma C.l. The scheme ( |30[ > can be restated as follows: 

Afc+i := prox M -i fe (A fc - M^Vgj A fc )) 

4 tfe+i := \ [l + -\A + 4f|] (33) 

Afc+i := Afc + i + ^ * (Afc + i — Afc), 

where Ao = Ao and to = 1, and Mk is determined based on the line-search condition ( |32| ). 

This dual scheme is of the FISTAform 0 /’ except for the line-search step. 

Proof. Let tk = rjf 1 , then to = Tq 1 = 1. From ( |30l >, we have Afc — Afc+i = ^-(A k — Afc+i) = 
tfc(Afc-Afc+i). We also have Afc = (l-r fc ) Afc + TfcAfc, which leads to A fc = i[A fc -(l- 7 *)Afc] = 
ffc[Afc — (1 — t] 7 1 )Afc], Combining these expressions, we get 

tfc(Afc — Afc +1 ) = Afc Afc +1 = tfc[Afc — (1 — tfc 1 )Afc] — ffe+i [Afc+i (1 tfc + 1 )Afc+i], 

and this can be simplified as follows: 

tfc+i Afc+i = tk Afc+i + ffc+i(l — ffc^) Afc+i — ffc(l — t k : )Afc 

= ifk T tk+1 1)Afc +1 (ffc l)Afc. 

Hence Afc+i = Afc+i + tt=l(Afc+i — Afc), which is the third step of ( |33[ ). 

Next, from the condition ( |3T| ), we have t\ +1 — tk+i — t\ = 0. Hence, tfc+i = \ Tl + + 4f|l, 

which is exactly the second step of ([33|- □ 
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C.l The proof of Theorem |4.2[ Convergence rate of the primal sequence 

Proof. From Lem m a [A. 2 [ we have 


G(Afc +1 ) ^ [<?(Afc) +(V g(X k ), A—A^)+/i(A)] + 

+ M k (X k — A/c+i, Afe —A)—2— II Afc+.i — Afc || 2 


< G(A) + —— h M k (X k — Afc +1 , A/. —A)——1| Afc +1 — At, || 2 . 

Note that these inequalities hold VA e M”. Next, we subtract G * from ( f34l > to get 

Tfc£ 


(34) 

(35) 


G(A fc+ 1 )-G* <s(A*) + <Vs(A fc ),A-A fc >+/i(A)-G*] + - 

+ M fc <Afc — Afc+i, X k A)—^-||Afc + i — Afc || 2 , 


(36) 

and we set A = A;,, in ( [35] ), and then subtract G* from the both sides, that results in the following 
inequality: 

G(A fe+1 )—G* < G(A fc )-G* + I^-^||A fe+1 -A fc || 2 + M k (\ k - X k+1 ,X k - X k >. (37) 

We obtain the following estimate by summing the two inequalities that we get by multiplying ( |36| ) 
by r k and © by (1 — r k ), and then dividing the resulting estimate by M k r%: 

(1 ~Tk) I 


M k rl 


[G(A fc+1 ) —G*] 




M k rl 


■ [G(Afc) — G*] + — [|| A* — A || 2 — ||A/+i — A|| 2 ] + 




1 


+ yT —[ 3 (-^fc) + '(Vg(Afe), A—Afe) + h( A) — G*]. (38) 

M k r k 


Next, we sum this inequality over k as follows: 

’(i-n) 


V 1 G(Af|_i) — G* ^ y 1 
2j m,t 2 : Zj 


«=o 


2 = 0 L 1 * 


,, 2 rG(A,) —G*] + ir||Ai —A || 2 — ||A^!—A|| 2 ] + — 

M 2 L V I) j 2 l " 11 11 ^ 11 J 2 Mt, 


1 


+ — [ 5 (A i )+<V 5 (A l ),A-A i > + h(X) - G*] 




Mi _ 

G(AQ —G* 1 

ei Afi-irjLt 2 


i 


[l|A 0 - A || 2 - ||A fc+ i - A|| 2 ] + y 2 W 


2=0 


MjTj, 


K 

+ Z w '_ !>&) + <Vff(A2), A - Aj> + h(A) - G*], 

1 for fc = 1 , 2 ,..., which holds 


2 = 0 


where the second inequality follows to = 1 and ^, f T V ^ tj -s 

k Tfc I*! fc _ i T^, _ i 

since (s M k _- t . This implies the followings: 

0 ' G( ^m « y E ITT [s A) + <Vg(A,),A - A,) + 4(A) - GC] 

bkMkTfc ok 2=0 

+ r^rlJ-V) - A|| 2 - |Afc+i - A|| 2 ] + - 

k 

==* _ 7tE 77—[ff(^*)+(V.g(Ai), X-Xi)+h(X)-G*] [||A 0 —A|| 2 —1|Afc +1 A|| 2 ] + —. 

Ok i=0 m i T i 23fc £ 

Now, we use the following expressions to map this estimate into the primal sequence: 

g {AO = -/(x* (Aj)) + <Ai, b - Ax* (A,)), 

Vfl(Ai) = b — Ax* (A,), 

G* = -d* = -/*. 
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Then, considering the convexity of /, we get 

/(xfc) — /* < (b — Ax*,-, A) + /i(A) + - + — [|Ao — A|| 2 — ||Afc+i — A|| 2 ] 

z 2 A fc 


< - + -—+- 
2 25 fc 


( 39 ) 


where we obtain the second inequality by setting A = 0”. 

We can reformulate ( f3l] > as 3- = - 4 —. Using this relation, Mq < Mj < M k < 2M eT) , = 

^"fe —1 

1 -^ _ _ 

2 t^ +v M e < 2(fc + 2) 1 +^ M e and ^±2 ^ t k < k + 2 for * = 0,1,..., k, we can show that 


k fc 

:= y — ^ y 

^ M iTi ^ 


i =o 


i=Q 2M eTk Ti 2M e 


K 


'i -1 


)] 




f 2 

c fe 




2(fc + 2)T+^M e 


, , l + 3iv 

(fe + 2 ) i+" 

8 M: 


(40) 


We get the bound on the right hand side of ( fT5j ) by substituting ( |40l ) into ( |39| . The inequality on the 
left hand side of © follows the saddle point formulation (|29l> by setting x = x fc . 


Finally, we prove the convergence rate in the feasibility gap ©■ By the same arguments as in the 
proof of Theorem |4T] we have 


dist (Ax/,. — b, K) < 


21| An 


S k 


+ J S k 


We complete the proof by substituting ( |40| into this estimate. 

C.2 The worst-case complexity analysis 


0 


We analyze the worst-case complexity of AccUniPDGrad algorithm to achieve an e-solution For 
simplicity, we consider the case Aq = 0” without loss of generality. Then, we require 


16 M f 


, , l + 3iz 

(k + 2) i+- 


II A* || + 


8M,i 


, . l + 3iz 

(k+ 2 ) i+- 


II A* | 


[i] 


e 


due to the Theorem 


4.2 


where ||A*||m := max {|| A*||, 1}. By solving this inequality, we get 


k + 2 ^ 


8V2||A* 


— 1 + A /I + 


, I A* II 

l|A*||[i] J 


2 + 2u 
1 + 3 v 




l + V 
l + 3iz 


1 — u 

Using the definition ([TO]) of M e and considering the fact that + ^1 for v e [0,1], we find 

the maximum number of iterations that satisfies the above inequality as follows: 


k — 


8V2||A* 


— 1 + a/1 + 


I A* I 


II [l] 


2+2u 

l+3u 


inf 

\ e 


f M v \ 1+ ‘ 3 


which is indeed ( fl7| >. 

Hence, the worst-case complexity to obtain an e-solution of (|T]i in the sense of Definition! l.l|is 


O inf 


M v \ 1+3 “ 
e 


which is optimal in the sense of first-order black box models 0 . 
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Next, we consider the number of oracle quires in AccUniPDGrad. At iteration k, the algorithm 
requires ik + 2 function evaluations of 5 , as we need if- +1 in the line-search and one for g{\k)- 
Hence, the total number of oracle quires up to the iteration k is N 2 (k) = (ij + 2). Since 

ij = log 2 (M 7 /Mj_i), we have 


N 2 (k) = 2 (k + 1) + log 2 (A/ fc ) - log 2 (Af_i). 

Using the same argument as in the proof of Lemma A.3 we have Mk R 2M e . 
2 1+ “ . Hence, we obtain ( p~ 8 ] > as 


<: 


N 2 (k) < 2(k + 1) + 1 + 


1 - v 
l + v 


[log 2 (fc + 1 ) — log 2 (e)] + 


2 

1 + v 


log 2 (M v ) - log 2 (Af_i). 


D The implementation details 

In this section, we specify key steps of UniPDGrad and AccUniPDGrad for two important applica¬ 
tions that we used in Section[5] We also provide an analytic step-size that guarantees the line-search 
condition without function evaluation. 

We performed the experiments in MATLAB, using a computational resource with 4 CPUs of 2.40 
GHz and 16 GB memory space for the matrix completion, and 16 CPUs of 2.40 GHz and 512 GB 
memory space for the quantum tomography problem. 

D.l Constrained convex optimization involving a quadratic cost 

In both quantum tomography and the matrix completion problems, we consider some problem for¬ 
mulations from the following convex optimization template that involves a quadratic cost: 

min { -M(x) — bll 2 : x e X 

xeRp (2 V 

Fox notational simplicity, we consider the problem in R 7 '/R n spaces in this section, but the ideas 
apply in general. 

Evaluation of the sharp-operator corresponding to the objective function l/2||*4(x) — b || 2 requires a 
significant computational effort. Yet, by introducing the slack variable r = _4(x) — b, we can write 
an equivalent problem as 

min i -llrll 2 : -4(x) — r = b, x e X 

(r.x)eR n xRP (_ 2 

We can write the Lagrange function associated with the linear constraint as 

£(r, x, A) = * |jr || 2 + <A, r - Al(x) + b>, 

from which we can derive the (negation of the) dual function 

SO) = - m in £(U X > A ) = - min {^||r|j 2 + <A, r>} + max<A, _4(x)> + <A, b> 
rei"-,xe^ reR" 2 xeX 

= ^||A || 2 + <A,b-Rl(x*(A))>, (41) 

and its subgradient 

V fl (A) = A-b + Al(x*(A)), 
where x*(A) e [„4 T (A)]^ = argmax x€ ^ <_4 T (A),x). 

For the special case, X is a norm ball, i.e., X = {x : ||x|| < k}, we can simplify © as follows: 

5 (A) = i||A|| 2 +<A,b> + K ||Rl T (A)||. (42) 
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Computing an analytical step-size: Now, we consider the line-search procedure in UniPDGrad 
and AccUniPDGrad. Since h( A) term is absent in these problems, the line-search condition ( |22| ) 
can be simplified as 

s(Afc+i) = g(X k - a k X7g{ A fc )) < g(X k ) - ||' Vg(X k ) || 2 + 4/2, (43) 

where we use the notational convention X k = A/, and 4 = e for UniPDGrad, and 4 = e/4- for 
AccUniPDGrad. Using the definition ( |42| , we can upper bound g(X k ~ o/ r V 9 ( A/,.)) by 

U(a fc ) := </(A fc ) + (c^/2)||V 5 (A fc )|| 2 - a k (X k - b, V. 9 ( X k )) + a fcK |.4 T (V 9 (A fc ))||. 

The condition ( |43j ) holds if U(a k ) = g{X k ) — ^■||V 9 (Afc)|| 2 + 4/2. Solving this second order 
equation, we obtain a k explicitly as 

—P + sj P 2 + 44II V<7( Afc) || 2 

2||V. 9 (A,)|| 2 ’ 

where P := ||V 9 (Afe)|j 2 + 2k||*4 t (V. 9 (A/ c ))|| — 2(Afc — b, ’Vg(X k )'). Note that, we can use this 
method to find a good estimate for the initial smoothness constant M_i in the initialization step. 

D.2 Constrained convex optimization involving a norm cost 

Now, we consider the second application, which is reformulated as 

min it/>(X) = -||X|| 2 :.4(X)-be/cl, 

XeKP * 1 l H ) 

where K, is an 4-norm ball, i.e., K. := { r : ||r|| ^ k }. Once again, by introducing the slack variable 
r = -4(X) — b, we get 

min I — IIXII 2 : -4(X) — r = b, re/cl. 

XeRP>o,reM" («" "* V 1 J 

Clearly, the dual components g and h defined in (|6| can be expressed as: 

5 (A) = max l<^ T (A),X> - 1 ||X|| 2 1 + <b, A) = n \\A T { A)|| 2 + <b, A), 

XeRp x 1 ( n ) 4 

h( A) = max /—A, r) = max (—A, r) = k||A||, 

re/C |r|^re 

where || • || represents the Euclidean norm for vectors and the spectral norm for matrices. In ( [2T| , we 
consider a special case where K, = {0"}, hence h( A) = 0. 

Clearly, X*(A) = cr 1 e 1 e/’ e [^4 r (A)]J ; ,, where a-\ = ||.A / (A)|| is the top singular value of _4 T (A) 
and ei is the associated left singular vector. Hence, we can write the (sub)gradient of g as 

Vfl(A) =b-.4(X*(A)). 

We can compute both <7i and G | efficiently by using the power method or the Lanczos algorithm. 
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